Document Summary:
This is an R notebook for the Fundulus heteroclitus clinal genetics analysis.

Top level summaries of methods and results are given in the main text. More detailed explanatation of each step are provided as comments in a header or footer within each code chunk.

Code chunks in R, python and bash are generally run locally, while shell scripts are nearly always run remotely on the Clark University high performance computing cluster.

Project Rationale:
Combine raw data from all F. heteroclitus GBS sequencing projects in the Oleksiak/Crawford lab into a single genotyping run. Then use these data to describe the population genetic structure of the entire F. heteroclitus species range and to parse adaptive genetic variation from neutral structure owing to secondary reconcact or isolation by distance.

Analysis Outline

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      tab6 [label = '@@6']
      tab7 [label = '@@7']
      tab8 [label = '@@8']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5 -> tab6 -> tab7
      tab6 -> tab8
      }

      [1]: 'library prep (Elshire GBS)'
      [2]: 'sequencing (illumina hiseq2000)'
      [3]: 'qc checks and datawrangling (fastqc and bash)'
      [4]: 'demultiplex and read filtering (stacks: process_radtags)'
      [5]: 'read mapping (bwa_mem)'
      [6]: 'genotype likelihoods (ANGSD)'
      [7]: 'demography and population structure'
      [8]: 'selection'
      
      ")

Sequencing and Genotyping Genotyping by sequencing libraries are produced according to Elshire et al 2011 and sequenced usig Illumina HiSeq. After initial quality control, the reads are cleaned and assigned to individual fish using the process_radtags fork of STACKS. The cleaned reads are mapped to the Fundulus heteroclitus genome using BWA-MEM. The resulting BAM files are then used as input to calculate genotype likelihoods in ANGSD.

Population Genetics All population genetic analysis is conducted from the genotype likelihoods. A summary of these methods are available in the relevant section below

Data description and QC

Populations and Sites

Samples

sites<-read.table("./sites/sites.txt", header = T, sep = "\t")
kable(sites)
Population_Full_Name Abbreviation N Longitude Latitude
Brayton Point, MA BP 50 -71.18604 41.71250
Clinton, CT CT 24 -72.52448 41.27902
Sapelo Isalnd Ferry Dock, Brunswick, GA GA 41 -81.35452 31.44932
Horseneck Beach, MA HB 50 -71.02556 41.50449
Kings Creek, Severn, VA KC 25 -76.42462 37.29845
Mattapoisett, MA M 31 -70.81464 41.65875
Mount Desert Island Bio Lab MDIBL 24 -68.29441 44.42901
Wiscassett, ME ME 27 -69.66481 43.99695
Mantoloking, NJ MG 331 -74.07045 40.05046
New Bedford Harbor NBH, F, H, P, Syc 150 -70.90760 41.62486
Manteo, NC NC 24 -75.66737 35.90029
Oyster Creek, NJ OC 47 -74.18475 39.80866
Rutgers, Tuckerton, NJ RB 423 -74.32448 39.50878
Charleston, SC SC 26 -79.91624 32.75873
Stone Harbor SH 62 -74.76492 39.05666
Slocum River, MA SLC 30 -70.97109 41.52257
Succotash Marsh, Matunuck, RI SR 49 -71.52557 41.38235
Wilmington, NC (Carolina Beach State Park) WNC 30 -77.91932 34.04998
grandis grandis 34 -84.17740 30.07200
#chunk to pull map

Sequence Data Description and Challenges

The clinal dataset is composed of 9 lanes of HiSeq2500 run in SE100. These reads cover 1478 individuals gathered from 5 separate experiments and an additional set of sequencing libraries used for regions undersampled in the previous experiments (CT, NC, SC and ME). All lanes use the same cut sites but vary in depth and quality of library preparation (some libraries were made from very old samples).

This variation in sequencing depth and quality produced serious issues when calling genotypes using TASSEL. Given the nature of combining experiments into a new analysis, populations were not distributed across unique library preparation and sequncing runs leading to variation in sequencing depth confounded with population. There was a strong inverse correlation between population-level sequencing depth and observed heterozygosity, suggesting that heterozygotes were falsely called as homozygotes in undersampled populations. To ameliorate the variation in library prep and sample quality, this analysis does not rely on called genotypes. Instead we use ANGSD to generate genotype-likelihoods and use these likelihoods to conduct all downstream analyses.

Other atypical challenges associated with this dataset are the inclusion of the sister species F. grandis in the seqeuncing data and unequal sampling across space and (potentially) lineages.

Lane names:

H7T6MADXX_1 H7T6MADXX_2 H9AREADXX_1 H9AREADXX_2 H9CWFADXX_1 H9WYGADXX_1 H9WYGADXX_2 HMMKLADXX_1 HMMKLADXX_2

QC reports

Here we run sequencing-level QC reports for each lane using fastqc before beginning the analysis


#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set the number of cpus
#SBATCH --cpus-per-task=6

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-02:00:00

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of job
#SBATCH --job-name=fastqc

# set name of output file
#SBATCH --output=fastqc.out

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu

files="H7T6MADXX_1
H7T6MADXX_2
H9AREADXX_1
H9AREADXX_2
H9CWFADXX_1
H9WYGADXX_1
H9WYGADXX_2
HMMKLADXX_1
HMMKLADXX_2
"

for file in $files
do
/home/jgibbons/SOFTWARE/FastQC/fastqc -t 4 /home/ddayan/fundulus/seqdata/${file}_fastq.gz -o /home/ddayan/fundulus/seqdata/fastqc

done

FastQC reports look good for all lanes.

Demultiplex and Read Filtering

We will use ANGSD to create population level allele frequency estimates rather than call indiviudal genotypes. The input files for ANGSD are indivudal level BAM files. Therefore the first step after fastQC check is to demultiplex (assign reads to an individual), remove low quality reads and trim adapter and barcode sequences. Ths is accomplished with the process_radtags program from stacks v2.3

Generate Barcodes

Barcode keys (files used to demultiplex sequence data) are a bit complex given that some fish are sequenced twice and stacks requires each lane be run separately through process_radtags. The solution taken here is to give duplicate fish a unique ID suffix, but with a shared prefix, then cleaned, demultiplexed reads from each fish in each lane are combined with their additional reads using prefix matching.

#Here we make new fish_ids, based on only population and library prep id, then mark duplicates (duplicate fish_ids occur because some individual fish are sequenced twice i.e. in different lanes)

barcodes<-read.table("./snp_calling/barcode_keys/cline_barcode_key.txt", sep = "\t", header = T)
tmp<-paste(barcodes$Pop , "_", barcodes$LibraryPrepID, sep = "")
barcodes$fish_id<-make.names(tmp, unique = TRUE)
write.csv(barcodes, row.names = FALSE, file = "./snp_calling/barcode_keys/cline_barcode_key.csv", quote = FALSE)
# used python script on hand for slightly different format: create a unique single variable for sequencing lane (i.e. colbind flowcell and lane with a _) then run python script below

""" The input file has four columns, this script takes writes columns 2 and 3 (barcode and individual) to a new file based on the value of column 4."""

import csv

with open('./snp_calling/barcode_keys/cline_barcode_key.csv') as fin:    
    csvin = csv.DictReader(fin)
    # Category -> open file lookup
    outputs = {}
    for row in csvin:
        cat = row['Flowcell_Lane']
        # Open a new file and write the header
        if cat not in outputs:
            fout = open('./snp_calling/barcode_keys/{}_key.csv'.format(cat), 'w')
            dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
            dw.writeheader()
            outputs[cat] = fout, dw
        # Always write the row
        outputs[cat][1].writerow(row)
    # Close all the files
    for fout, _ in outputs.values():
        fout.close()
# Oops, only meant to keep barcode and sample id and need to write to tab delimited file instead. 

for i in ./snp_calling/barcode_keys/*csv
do
  cut -d "," -f 2,10 $i > ${i%.csv}.tmp
done


for i in ./snp_calling/barcode_keys/*tmp
do
    tr "," "\\t" < $i > ${i%.tmp}_barcodes.txt
done
#clean up directory
rm ./snp_calling/barcode_keys/*.tmp
rm ./snp_calling/barcode_keys/*[12]_key.csv
#remove first line from all files
sed -i -e "1d" ./meta_data/*_barcodes.txt

process_radtags

Here we filter and demultiplex the reads using stacks process_radtags

process radtags options:

-f single end
-c remove any read with an uncalled base
-q remove any read with low quality
-r rescue barcodes
–inline-null barcodes inline
-e aseI
–adapter-1 trim adapter sequence (used AGATCGGAAGAGCCGTTCAGCAGGAATGCCGAGACCGATCTCG (common adapter from Elshire et al 2011))

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of output file
#SBATCH --output=library_%a_process_radtags.out

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL

# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu


source /opt/stacks-2.3/bin/source_me
/opt/stacks-2.3/bin/process_radtags -f ./seqdata/HMMKLADXX_2_fastq.gz -b ./meta_data/HMMKLADXX_2_key_barcodes.txt -o ./cleaned_tags -e aseI --inline-null -c -q -r --adapter-1 AGATCGGAAGAGCCGTTCAGCAGGAATGCCGAGACCGATCTCG  &> ./cleaned_tags/pr_library_HMMKLADXX_2.oe

#this script is called for each lane - not include in R notebook

Summary of process radtags:

A summary of number of reads for each lane is in the code chunk below

H7T6MADXX_1  
162739994 total sequences   
 13628228 reads contained adapter sequence (8.4%)  
   312933 barcode not found drops (0.2%)  
   828040 low quality read drops (0.5%)  
 14062227 RAD cutsite not found drops (8.6%)  
133908566 retained reads (82.3%)  
  
H7T6MADXX_2  
151505038 total sequences  
 13951706 reads contained adapter sequence (9.2%)  
   132166 barcode not found drops (0.1%)  
   584797 low quality read drops (0.4%)  
  8662501 RAD cutsite not found drops (5.7%)  
128173868 retained reads (84.6%)  
  
H9AREADXX_1_fastq.gz  
148867111 total sequences   
  4339944 reads contained adapter sequence (2.9%)  
   306024 barcode not found drops (0.2%)  
   305700 low quality read drops (0.2%)  
  5881380 RAD cutsite not found drops (4.0%)  
138034063 retained reads (92.7%)  
  
H9AREADXX_2_fastq.gz  
122597494 total sequences  
   720840 reads contained adapter sequence (0.6%)  
   185795 barcode not found drops (0.2%)  
   198200 low quality read drops (0.2%)  
  8394213 RAD cutsite not found drops (6.8%)  
113098446 retained reads (92.3%)  
  
H9CWFADXX_1  
110486132 total sequences  
 13263006 reads contained adapter sequence (12.0%)  
    98479 barcode not found drops (0.1%)  
   775153 low quality read drops (0.7%)  
  4943084 RAD cutsite not found drops (4.5%)  
 91406410 retained reads (82.7%)  
  
H9WYGADXX_1  
138847370 total sequences  
  3202776 reads contained adapter sequence (2.3%)  
   846988 barcode not found drops (0.6%)  
   267348 low quality read drops (0.2%)  
  3683206 RAD cutsite not found drops (2.7%)  
130847052 retained reads (94.2%)  
  
H9WYGADXX_2  
159871680 total sequences  
  5477072 reads contained adapter sequence (3.4%)  
  1301582 barcode not found drops (0.8%)  
   294833 low quality read drops (0.2%)  
  6884903 RAD cutsite not found drops (4.3%)  
145913290 retained reads (91.3%)  
  
HMMKLADXX_1  
147179868 total sequences  
  6490480 reads contained adapter sequence (4.4%)  
    82644 barcode not found drops (0.1%)  
   940503 low quality read drops (0.6%)  
 10192934 RAD cutsite not found drops (6.9%)  
129473307 retained reads (88.0%)  
  
HMMKLADXX_2  
149288526 total sequences  
  6056602 reads contained adapter sequence (4.1%)  
    71084 barcode not found drops (0.0%)  
  1064401 low quality read drops (0.7%)  
 12505196 RAD cutsite not found drops (8.4%)  
129591243 retained reads (86.8%)  

Sum of all lanes
1,291,383,213 total sequences
67,130,654 reads contained adapter sequence (5.2%)
3,337,695 barcode not found drops (0.3%)
5,258,975 low quality read drops (0.4%)
75,209,644 RAD cutsite not found drops (5.8%)
1,140,446,245 retained reads (88.3%)

combine duplicate cleaned tags

Here we concatenate fastq files from fish with reads across multiple lanes based on prefix mathcing.

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=1

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of output file
#SBATCH --output=

# mail alert at start, end and abortion of execution
#SBATCH --mail-type=NONE

prefixes="BP_406
BP_407
BP_408
BP_409
BP_410
BP_436
BP_437
BP_438
BP_439
BP_440
BP_466
BP_467
BP_468
BP_469
BP_470
BP_496
BP_497
BP_498
BP_499
BP_500
BP_526
BP_527
BP_528
BP_529
BP_530
BP_556
BP_557
BP_558
BP_559
BP_560
BP_586
BP_587
BP_588
BP_589
BP_590
BP_616
BP_617
BP_618
BP_619
BP_620
BP_646
BP_647
BP_648
BP_649
BP_650
BP_676
BP_677
BP_678
BP_679
BP_680
Hb_401
Hb_402
Hb_403
Hb_404
Hb_405
Hb_432
Hb_433
Hb_434
Hb_463
Hb_464
Hb_465
Hb_491
Hb_492
Hb_493
Hb_495
Hb_521
Hb_522
Hb_523
Hb_524
Hb_525
Hb_552
Hb_555
Hb_581
Hb_582
Hb_583
Hb_584
Hb_585
Hb_612
Hb_613
Hb_615
Hb_641
Hb_645
Hb_671
Hb_672
Hb_673
Hb_674
Hb_675
Mg_416
Mg_417
Mg_418
Mg_419
Mg_420
Mg_446
Mg_447
Mg_448
Mg_449
Mg_450
Mg_476
Mg_478
Mg_479
Mg_480
Mg_508
Mg_509
Mg_536
Mg_537
Mg_539
Mg_540
Mg_567
Mg_568
Mg_570
Mg_596
Mg_597
Mg_598
Mg_599
Mg_627
Mg_628
Mg_629
Mg_630
Mg_658
Mg_660
Mg_685
Mg_686
Mg_687
Mg_688
OC_421
OC_422
OC_423
OC_424
OC_425
OC_451
OC_452
OC_453
OC_454
OC_455
OC_481
OC_482
OC_483
OC_484
OC_485
OC_511
OC_512
OC_513
OC_514
OC_515
OC_541
OC_542
OC_543
OC_544
OC_545
OC_571
OC_572
OC_573
OC_574
OC_575
OC_601
OC_602
OC_603
OC_604
OC_605
OC_631
OC_632
OC_633
OC_634
OC_635
OC_661
OC_662
OC_663
OC_664
OC_665
OC_689
OC_690
RB_427
RB_428
RB_430
RB_456
RB_457
RB_458
RB_459
RB_460
RB_487
RB_488
RB_489
RB_490
RB_516
RB_519
RB_546
RB_547
RB_548
RB_549
RB_550
RB_576
RB_577
RB_579
RB_580
RB_606
RB_607
RB_608
RB_610
RB_636
RB_637
RB_638
RB_639
RB_640
RB_668
RB_669
RB_670
RB_691
RB_692
RB_693
RB_694
RB_695
SR_411
SR_412
SR_413
SR_414
SR_415
SR_441
SR_442
SR_443
SR_445
SR_471
SR_472
SR_473
SR_474
SR_475
SR_501
SR_502
SR_504
SR_531
SR_532
SR_534
SR_535
SR_562
SR_564
SR_565
SR_591
SR_594
SR_595
SR_622
SR_624
SR_625
SR_651
SR_652
SR_653
SR_655
SR_682
SR_683
SR_684
SH_2135
MG_2136
RB_2137
RB_2138
SH_2139
MG_2140
MG_2141
MG_2142
SH_2143
SH_2144
MG_2145
RB_2146
MG_2147
SH_2148
RB_2149
MG_2150
RB_2151
RB_2152
SH_2153
SH_2154
RB_2155
SH_2156
SH_2157
MG_2158
MG_2159
SH_2160
MG_2161
MG_2162
SH_2163
RB_2164
RB_2165
SH_2166
SH_2167
MG_2168
MG_2169
MG_2170
SH_2171
RB_2172
SH_2173
RB_2174
F_5001
NBH_5002
P_5003
SLC_5004
Syc_5005
H_5006
M_5007
F_5008
NBH_5009
P_5010
SLC_5011
Syc_5012
M_5013
F_5014
NBH_5015
P_5016
SLC_5017
Syc_5018
H_5019
M_5020
F_5021
NBH_5022
P_5023
SLC_5024
H_5025
M_5026
F_5027
NBH_5028
P_5029
SLC_5030
Syc_5031
H_5032
M_5033
F_5034
NBH_5035
P_5036
Syc_5037
H_5038
M_5039
F_5040
NBH_5041
P_5042
SLC_5043
Syc_5044
H_5045
M_5046
F_5047
NBH_5048
SLC_5049
Syc_5050
H_5051
M_5052
F_5053
NBH_5054
P_5055
SLC_5056
Syc_5057
H_5058
M_5059
F_5060
P_5061
SLC_5062
Syc_5063
H_5064
M_5065
F_5066
NBH_5067
P_5068
SLC_5069
Syc_5070
H_5071
M_5072
NBH_5073
P_5074
SLC_5075
Syc_5076
H_5077
M_5078
F_5079
NBH_5080
P_5081
SLC_5082
Syc_5083
H_5084
F_5085
NBH_5086
P_5087
SLC_5088
Syc_5089
H_5090
M_5091
F_5092
NBH_5093
P_5094
SLC_5095
Syc_5096
P_5097
SLC_5098
Syc_5099
H_5100
M_5101
F_5102
NBH_5103
P_5104
SLC_5105
Syc_5106
H_5107
M_5108
NBH_5109
P_5110
SLC_5111
Syc_5112
H_5113
M_5114
F_5115
NBH_5116
P_5117
SLC_5118
Syc_5119
H_5120
F_5121
NBH_5122
P_5123
SLC_5124
Syc_5125
H_5126
M_5127
F_5128
NBH_5129
P_5130
SLC_5131
Syc_5132
M_5133
F_5134
NBH_5135
P_5136
SLC_5137
Syc_5138
H_5139
M_5140
F_5141
NBH_5142
P_5143
SLC_5144
H_5145
M_5146
F_5147
NBH_5148
P_5149
SLC_5150
Syc_5151
H_5152
M_5153
F_5154
NBH_5155
P_5156
Syc_5157
H_5158
M_5159
F_5160
NBH_5161
P_5162
SLC_5163
Syc_5164
H_5165
M_5166
F_5167
NBH_5168
SLC_5169
Syc_5170
H_5171
M_5172
F_5173
NBH_5174
P_5175
SLC_5176
Syc_5177
H_5178
M_5179
F_5180
P_5181
SLC_5182
Syc_5183
H_5184
M_5185
F_5186
NBH_5187
P_5188
SLC_5189
Syc_5190
H_5191
M_5192
H_5193
M_5194
Syc_5200
H_5201
SLC_5207
Syc_5208
P_5214
SLC_5215
NBH_5221
P_5222
F_5228
NBH_5229
M_5230
M_5236
F_5237
NBH_5238
H_5244
M_5245
F_5246
" #list of duplicated sample_names - get from R

for i in $prefixes; do cat "$i"* > "$i".merged.txt ; done

for i in $prefixes; do rm "$i".fq.gz; rm "$i".1.fq.gz; done

Read Mapping

Now that all reads have been demultiplexed and cleaned, and duplicate sequencing runs for indivudals have been merged, we align to the F. heteroclitus genome and save results as individual level BAM files.

Summary

  • downloaded genome v3.0.2 from ENSEMBL
  • indexed with bwa
  • aligned all reads to f_het genome using bwa-mem and default options
#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00

#SBATCH --cpus-per-task=38

# set partition/queue to use
#SBATCH --partition=week-long-cpu

# set name of job
#SBATCH --job-name=bwa_default

# set name of output file
#SBATCH --output=bwadefault.out

# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu


source /opt/samtools-1.6/source_me

for i in `find ../cleaned_tags/ -name '*.gz'`
do
/opt/bio-bwa/bwa mem -t 38 ../genome/f_het_genome_3.0.2.fa.gz ../cleaned/${i} | /opt/samtools-1.6/bin/samtools view -@ 16 -bSu - | /opt/samtools-1.6/bin/samtools sort -@ 16 - -o ./${i:0:-6}.$

done

BAM stats

Next is a QC check on the mapping process.

What are the depths of mapped reads (from bam files)?

#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00

#SBATCH --cpus-per-task=1

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of job
#SBATCH --job-name=bam_depths

# set name of output file
#SBATCH --output=bam_depths

# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu

source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools

for i in *bam
do
$samtools depth $i | awk -v var="$i" '{sum += $3; n++} END {print " '${i%.bam}' ", sum / n}' >> depths.txt
done
bam_depths<-read.table("./snp_calling/QC/depths.txt", sep = " ")
bam_depths<-bam_depths[,-c(1,3)]
colnames(bam_depths)<-c("fish_id", "depth")
ggplot(data = bam_depths)+geom_density(aes(x = depth))+labs( x = "Sequencing depth", y = "Density")
Read Mapping Fig. 1: Mean depth at mapped reads in BAM file- averaged across mapping locus and individual

Read Mapping Fig. 1: Mean depth at mapped reads in BAM file- averaged across mapping locus and individual

#we could also easily split up the depth file here to look at population level variation in depth

Most individuals have greater than 10x sampling depth on average across all loci.

However, the number of mapped loci may vary across individuals, so let’s also look at overall number mapped reads per sample for a better idea of the distribution of reads per individual in the dataset.

#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00

#SBATCH --cpus-per-task=1

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of job
#SBATCH --job-name=bam_mapped_reads

# set name of output file
#SBATCH --output=bam_mapped_reads.out


source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools

for i in *bam
do
{ $samtools view -c -F 4 $i; echo ${i%.bam} ; } | tr "\n" " " >> mapped_reads.txt
done
mapped_reads<-read.table("./snp_calling/QC/mapped_reads.txt", sep = " ")
ggplot(data = mapped_reads)+geom_histogram(aes(x = log10(mapped_reads[,1])))+labs(x = "log10 reads")
Read Mapping Fig. 2: Total number of reads per individual

Read Mapping Fig. 2: Total number of reads per individual

Read Mapping fig 2. highlights the need for an genotype likelihood based approach. Even excluding outlier individuals, the sampling depth varies 20-50 fold among samples. This figure also shows that Read Mapping fig. 1 is misleading. The reason for similarity in read depth depicted Read Mapping fig. 1 might be because of dropout of loci in individuals with low depth. Also of note is the roughly 5% of individuals with extremely poor coverage (less than ~50000 reads)

F. grandis

Keeping grandis in the analysis may be useful to reconstruct the ancestral state of the genome and therefore create an unfolded site frequency spectrum (SFS), but first we need to rationalize maintaining them in the genotyping run.

How well do the grandis reads align to the fundulus reference genome? Let’s compare the bam files:

colnames(mapped_reads) <- c("reads", "sample")
mapped_reads$pop<-str_extract(mapped_reads$sample, "[^_]+")
#mapped_reads %>%
#  group_by(pop) %>%
#  summarize(mean = mean(reads), sd = sd(reads), n = n())
# can uncomment this if a reviewer wants this picture of the variation later

ggplot()+geom_density(data = mapped_reads[(mapped_reads$pop!="grandis"),],aes(x=log10(reads)))+geom_density(data = mapped_reads[(mapped_reads$pop=="grandis"),],aes(x=log10(reads)), color = "red") + labs(x = "Log10 Reads") 
Read Mapping Fig. 3: Total number of reads (log10) for F. heteroclitus (black) vs F. grandis

Read Mapping Fig. 3: Total number of reads (log10) for F. heteroclitus (black) vs F. grandis

This looks great, a similar number of reads map to the genome regardless of the species the reads originate from. This suggests that BWA-mem can tolerate the mismatches owing to species level variation well enough to map the reads.

this isn’t ideal though, lets try to figure out the proportion of reads successfully mapped to the reference genome

#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00

#SBATCH --cpus-per-task=1

# set partition/queue to use
#SBATCH --partition=day-long-cpu

# set name of job
#SBATCH --job-name=bam_mapped_reads

# set name of output file
#SBATCH --output=bam_mapped_reads.out



source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools

for i in *bam
do
{ $samtools view -c -F 0 $i; echo ${i%.bam} ; } | tr "\n" " " >> total_reads.txt
done

#this script pulls the totla reads (not mapped reads) form the bam files using the samtools bitset flag 0
#needed to fix this output quickly using regex -replaced every second space with a linebreak
#make new dataframe for proportion of reads and plot
total_reads <- read.table("./snp_calling/QC/total_reads.txt", sep = " ")
# make column of total reads, then proportion reads, plot again
colnames(total_reads) <- c( "total", "sample")
mapped_reads <- cbind(mapped_reads, total_reads$total)
colnames(mapped_reads)[4] <- c("total_reads")

mapped_reads$proportion <- mapped_reads$reads/mapped_reads$total_reads

ggplot()+geom_density(data = mapped_reads[(mapped_reads$pop!="grandis"),],aes(x=log10(proportion)))+geom_density(data = mapped_reads[(mapped_reads$pop=="grandis"),],aes(x=log10(proportion)), color = "red") + labs(x = "Log10 Reads") 
Read Mapping Fig. 4: Total proportion of mapped reads (log10) for F. heteroclitus (black) vs F. grandis

Read Mapping Fig. 4: Total proportion of mapped reads (log10) for F. heteroclitus (black) vs F. grandis

A similar proportion of reads map regardless of species, suggesting the divergence between the species is not so high that it will cause major issues to include grandis

Population Structure

First final results! This section details the population genetic structure of the species range using PCA (implemented in PCangsd) and a bayesian classifier similar to STRUCTURE called NGS-Admix.

PCA

The PCA relies on a covariance matrix estimated by PCangsd from the genotype likelihoods. This Covariance matrix is then subjected to eigen-decomposition to produce the PCA of genetic distance among samples.

#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00

#SBATCH --cpus-per-task=38

# set partition/queue to use
#SBATCH --partition=week-long-cpu

# set name of job
#SBATCH --job-name=pcangsd_1.3

# set name of output file
#SBATCH --output=pcangsd_1.3.out



source /home/ddayan/software/pcangsd/bin/activate
python /home/ddayan/software/pcangsd/pcangsd.py -beagle ../Results/gl_1.3.beagle.gz -o pcangsd.gl_1.1 -threads 38 -minMaf 0.0

PCANGSD run info:
Read 1371 samples and 73375 sites

Estimating population allele frequencies EM (MAF) converged at iteration: 12

Estimating covariance matrix Using 5 principal components (MAP test)

cov_mat<-read.table("./pcangsd/pcangsd.gl_1.3.cov", sep = " ")

filenames <- mapped_reads[! mapped_reads$sample %in% bad_inds,]
filenames <- filenames[!grepl("grandis", filenames$sample),]
filenames <- filenames[-c(1372, 1373),]


fish_names =  gsub(".merged", "", filenames$sample)

pc2<-prcomp(cov_mat)

save(pc2, file = "./pcangsd/pca2.R")

pc_inds<-as.data.frame(pc2$x)
row.names(pc_inds)<-fish_names
pc_inds$pop<-gsub("_\\d+", "", fish_names)

# write.table(pc_inds, file = "./pcangsd/pca.txt", quote = FALSE) #commented to avoid overwriting, run if doing first run through analysis


#caution this deletes al nbh pops except nbh, fix later
pc_inds <- merge(pc_inds, sites[,c("Abbreviation", "Latitude")], by.x = "pop" , by.y ="Abbreviation")

ggplot(data = pc_inds) + geom_point(aes(x = PC1, y = PC2, color = Latitude))+scale_color_gradientn(colours = rainbow(5))#+stat_ellipse(aes(x = PC1, y = PC2, group = pop),type = "norm")
Results Fig. 1: Principal component analysis of genetic variation. Populations colored by latitude.

Results Fig. 1: Principal component analysis of genetic variation. Populations colored by latitude.

plot_ly(x=pc_inds$PC1, y=pc_inds$PC2, z=pc_inds$PC3, type="scatter3d", mode="markers", color=pc_inds$pop)

Results Fig. 2: Principal component analysis of genetic variation, including 3rd pc. hover mouse over points for population id - population codes are as in the table “sites” under the section “data description” at top of document.

screeplot(pc2)
Screeplot of PCA

Screeplot of PCA

PCA Summary
The final PCA corroborates previous descriptions of the F heteroclitus population genetic cluster. There is a break centered at Northern New Jersey between two distinct clusters, within each cluster a pattern of isolation by distance dominates the structure.

Admixture

NGSadmix is used to estimate ancestry for K = 1-6 with 10 replicate runs and best k determined by the Evanno method (not included in notebook - online app). Run details for publication quality and other other k plotting are in the “ngs_admix” directory.

Below is a structure plot of the best K according to Evanno method (k=2).

#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00

#SBATCH --cpus-per-task=10

# set partition/queue to use
#SBATCH --partition=week-long-cpu

# set name of job
#SBATCH --job-name=ngsadmix_1.3_k9

# set name of output file
#SBATCH --output=ngsadmix_1.3_k9


NGSadmix="/opt/angsd0.928/angsd/misc/NGSadmix"
BEAGLE="/home/ddayan/fundulus/ANGSD/Results/gl_1.3.beagle.gz"

Out_Prefix="1.3_k9"
K="9"

$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.1 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.2 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.3 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.4 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.5 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.6 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.7 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.8 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.9 -P 10 -minMaf 0
$NGSadmix -likes $BEAGLE -K $K -o $Out_Prefix.10 -P 10 -minMaf 0
#Plot admixture portions for a single run of k = 2

#make popfile
pops<- as.data.frame(cbind(fish_names, gsub("_\\d+", "", fish_names)))
colnames(pops) <- c("sample", "pop")

#plot a admixture plot
q<-read.table("./ngs_admix/sandbox/outFileName.qopt")
    #set order of poulations by latitude

q$sample <- pops$sample
q$pop <- pops$pop

pops$pop <- factor( as.character(pops$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL") )
ord <- order(pops$pop)

q <- q[ord,]

plot_data <- q %>% 
  mutate(id = sample) %>% 
  gather('cluster', 'prob', V1:V2) %>% 
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup() %>% 
  dplyr::arrange(factor( as.character(pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL")  ))

#reorder levels
plot_data$pop <- factor( as.character(plot_data$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "Mg", "CT", "SR", "BP", "HB", "Hb", "H", "F", "M", "Syc", "NBH", "P", "SLC", "ME", "MDIBL")  )



#ggplot(plot_data, aes(id, prob, fill = cluster)) +
#  geom_col() +
#  theme_classic()

ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col() +
  facet_grid(~pop, scales = 'free', space = 'free')+ theme(panel.spacing=unit(0.01, "lines"), axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(), legend.position = "none")

Data also formatted for clumpak visualization below


(for log in `ls *.log`; do grep -Po 'like=\K[^ ]+' $log; done) > logfile
logs <- as.data.frame(read.table("./ngs_admix/logfile_likelihoods.txt"))
logs$K <- c(rep("1", 10), rep("2", 
    10), rep("3", 10), rep("4", 10), rep("5", 10), rep("6", 10))
colnames(logs)[1] <- "likelihood"
write.table(logs[, c(2, 1)], "/ngs_admix/logfile_likelihood_formatted.txt", row.names = F, 
    col.names = F, quote = F)

Note - delta K method suggest best K is 2, but prob(k) is maximixed at k = 6, suggesting runs of further K may be neccessary.

Admixture Results
The STRUCTURE plot above agrees with previous analyses. The primary aspect of population structure is the break north of New Jersey (i.e. best K is 2 and splits into two groups north and south of the Hudson Basin). Also of note on the k=2 plot is the evidence of introgression of northern ancestry into the admixture zone throughout New Jersey (in population SH, RB, OC, and MG). Surprisingly there is some introgression of southern alleles into the southern extremes of the northern population in Connecticut and Rhode Island(CT, and SR)

Landscape Genetics

require(vegan)
require(ade4)
require(adespatial)
require(SoDA)
require(marmap)

Rationale:
Here the goal is to use multivariate genotype environmental association analysis to decompose the genetic variation along the cline into adaptive and neutral components, by finding genetic variation that correlates with putatively selective environmental variable after correcting for spatial autocorrellation. We use cRDA - redundancy analyses of components. Redundancy analysis is a multivariate constrained ordination that finds linear combinations of multiple explanatory variables (in this case temperature variation and spatial autocorrelation) that are “redudant with” (or correlate with) linear combinations of the response variable (in this case genetic variation). In our case the response variable is not individual genotypes, but components of genetic variation among individuals - a PCA of the covariance matrix produced by PCangsd. The type of RDA - termed a cRDA - is less powerful than a traditional RDA for finding outlier loci associated with the environmental variable because the individual SNPs must be subsequently identified as those correlated with the significantly constrained axes, i.e. the RDA is on the components of genetic variation not directly on the SNPs. This approach is preferable for our purposes, because (a) we are more interested in identifying the proportion and patterns of polygenic variation that may be due to selection and dempgraphy than identifying the actual SNPs and (b) conducting RDA on SNP level variation ould require calling SNPs rather than relying on genotype probabilities used to calculate the covariance matrix.

The genetic variation (principal components) are modelled as an effect of two sets of variables (1) a spatial variable that capture netural structure owing to isolation by distance and (2) environmental variables with a priori support that they drive adaptive evolution in this species. Further details for these variables appear in the relevant sections below.

Spatial and Environmental Data

First we set up the dataset to run the RDA

Spatial Variables
The spatial variable is a distance bases moran’s eigenvector map (db-MEM). DB-MEM is used to decompose the spatial structure into a set of orthoganl eigenfuctions thereby providing better resolution of autocorrelation across spatial scales.

To calculate the db-MEMs we use a distance matrix based on travel distance among sites. i.e. Rather than using distance matrices based on latitude and longitude, we make a more biologically relevant model where spatial distances among populations is determined by connections via shallow water <15 meters) this is accomplished via the marmap package in R

plot(atlantic, image = TRUE, land = TRUE, lwd = 0.03, bpal = list(c(0, max(atlantic), greys),c(min(atlantic), 0, blues)))
points(sites[,c(3,2)], pch = 21, col = "blue", bg = col2alpha("blue", .9),cex = 1.2)
lapply(path, lines, col = "orange", lwd = 2, lty = 1)->dummy

Now that more ecologically relevant distances are calculated we estimate distance-based eigenvector maps

#prep site data for dbmem
  #merge all NBH inds to a single pop and convert to all caps
pops <- pops %>%  
  mutate(pop = as.character(pop)) %>%
  mutate(pop = replace(pop, pop == "F", "NBH")) %>%
  mutate(pop = replace(pop, pop == "P" , "NBH")) %>%
  mutate(pop = replace(pop, pop ==  "Syc" , "NBH")) %>%
  mutate(pop = replace(pop, pop ==  "H", "NBH")) %>%
  mutate(pop = replace(pop, pop == "Mg", "MG")) %>%
  mutate(pop = replace(pop, pop == "Hb", "HB"))


ind_sites <- pops %>%
  left_join(., sites, by = "pop")

distance_matrix <- lc.dist(trans, ind_sites[,c(4,3)], res = "dist")

#calculate dbmem and plot
dbmem_pos<-dbmem(distance_matrix, MEM.autocor = "positive")
plot(attributes(dbmem_pos)$values)
abline(h = 0)
Figure Caption: Plot of db-mems I values

Figure Caption: Plot of db-mems I values

The first 8 dbmems have moran-s I values above 0 (i.e describe spatial autocorrelation) and are retained as explanatory variables in the RDA

Temperature

The environmental data is daily SSTs from the NOAA OI SST V2 High Resolution Dataset.

We extracted 10 years of daily optimum interpolated sea surface temperatures (OISSTs) from NOAA/OAR/ESRL PSD (Reynolds et al. 2007) for each of the sampling locations and calculated three derived environmental variables: mean temperature, mean annual minimum and mean annual maximum. then calculated PCA. Scripts for this data extraction are not included in this R notebook.

#env variable for pops
env<-read.table("./environmental/temp_summary.txt", sep = "\t", header = T)
env <- env %>%
  left_join(., pops, by = "pop")

#pca
env_pc <- prcomp(env[,c(2,3,4)], scale = TRUE)
biplot(env_pc)
plot(env_pc)
env <- cbind(env, env_pc$x)
Temperature variation PCA and associated screeplotTemperature variation PCA and associated screeplot

Temperature variation PCA and associated screeplot

pairs.panels(env[,c(2,3,4,5,7,8,9)], scale = TRUE)
Comparing temperature variables

Comparing temperature variables

After all of that, chose to keep only two environmental variables, mean_annual_min and mean_annual max, because tenyearmean is so correlated with both of the other variables.

Genetic Data

We use the covariance matrix calculated by PCANGSD to conduct a PCA that describes the genetic variation among samples. Then choose the number of PCs to retain using the Kaiser Guttman criteria

#pca
#to load pca: attach("./pcangsd/pca.r")
#snp_pca<-read.table("./pcangsd/pca.txt", header = T, sep= "\t")

#kaiser guttman
cutoff<-mean(pc2$sdev^2)
kg <- length((pc2$sdev^2)[(pc2$sdev^2)>cutoff])

#pcs to keep
snp_pcs <- pc2$x[,c(1:103)]

We keep the first 103 pcs to fit our RDA on

Final Dataset

To make coding easier, merge explanatory vars into single dataset with varnames

ex_var <- cbind(dbmem_pos, env[,c(2,3)])

RDA

In this section we run and visualize the RDA results

First we run variable selection procedures and settle on the final model. Output for this section is too long to include in notebook, but stored in ./rda/rda_results_log.txt

#global rda
rda_null <- rda(snp_pcs ~ 1, data = ex_var)
rda_full <- rda(snp_pcs ~ . , data = ex_var)

#check that the full model is significant
anova(rda_full) # 0.001 yes it is significant - proceed to variable selection

#what the variance explained
adjR2.rda_full <- RsquareAdj(rda_full)$adj.r.squared # 76%

#variable selection
ordiR2 <- ordiR2step(rda_null, scope = formula(rda_full, data = ex_var))
ordiR2$anova #MEM7 not included in variable selection
ordi <- ordistep(rda_null, scope = formula(rda_full, data = ex_var))
rda.fs <- forward.sel(Y = snp_pcs, X = ex_var, adjR2thresh = adjR2.rda_full)

vif.cca(rda_full)

Then we fit the final model and test for significance of constrained axes and the explanarotry variable on margin. We also test the global signifance of the models. Detailed outputs are available in ./rda/rda_results_log.txt

#final model
rda_final <- rda(snp_pcs ~  mean_annual_max + mean_annual_min + MEM1 + MEM2 + MEM3 + MEM4 + MEM5 + MEM6 + MEM7+ MEM8 , data = ex_var)

#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(rda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'holm')

#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <-  anova.cca(rda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')

#variance partitioning
vp <- varpart(snp_pcs , ex_var[,c(9,10)] , ex_var[,c(1:8)])

#controlling for spatial autocor
rda_ind_con_spatial<-rda(snp_pcs ~ as.matrix(env[,c(2)]) + Condition((as.matrix(dbmem_pos)[,c(1:8)])))

#controlling for temp
rda_ind_con_temp<-rda(snp_pcs ~ Condition(as.matrix(env[,c(2)])) + (as.matrix(dbmem_pos)[,c(1:8)]))

#anovas
aov_global <- anova(rda_final)
aov_con_spatial <- anova(rda_ind_con_spatial) 
aov_con_temp <- anova(rda_ind_con_temp)

RDA Results Summary

We fit a global model of all variables. This full model was significant. We then performed forward variable selection. Three approaches yielded 3 results. Ordistep (vegan) ordiR2step(vegan), and forward.sel(adespatial). Ordistep (uses AIC) call for a full model. OrdiR2step (uses adjusted R2 and p value only) retains all MEMs and mean annual min. Forward.sel retains all MEMs and no temperature variables. Variance inflation factors are low for all explanatory vars (< 3.4). For the sake of ecological interpretability we retain both environmental vars - i.e. went with AIC.

The first 6 RDA axes are significant (i.e. explain significantly more than the equivalent unconstrained axes, Holms adjusted p value < 0.05). All explanatory variables are also significant (i.e. significantly greater explanation of variance for the variable, once the data is conditioned on all other variables, Holms adjusted p value < 0.05)

Collectively these suggest that there is a significant relationship between the environmental temperature variable and the genetic covariance among samples, once the genetic data is conditioned on the effect of spatial autocorrelation AND vice versa. 76.5% of the total variation in the genetic dataset (or at least the first 103 of 1371 principal components of it) was constrained by the RDA.

Using variance partitioning, the relative effect of spatial autocorrelation and temperature could be estimated. Adjusted R2 for the effect of temperature, once spatial autocorrelation is accounted for, is 0.1%. In contrast 64.7% of the genetic dataset was explained by spatial autocorrelation alone. 11.4% of variation was explained jointly and could not be parsed. 23.7% was residual.

Examining the triplot suggests that the first dbmem separates northern from southern at the NJ.

RDA figures and results tables

## tri plot
#site score
rda_sum<-summary(rda_full, axes = c(6))
scores <- as.data.frame(rda_sum$sites)
scores$pop<-pops$pop
scores$pop <- factor( as.character(scores$pop), levels=c("GA", "SC", "WNC", "NC", "KC", "SH", "RB", "OC", "MG", "CT", "SR", "BP", "HB",  "NBH", "P", "SLC", "ME", "MDIBL")  )

#species scores
arrows<-as.data.frame(rda_sum$biplot)

#plot
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), data = scores)+geom_segment(data=arrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="red")+geom_text(data=arrows, aes(x=RDA1+RDA1*0.2, y=RDA2+RDA2*0.2, label=c("MEM1", "MEM2","MEM3","MEM4","MEM5", "MEM6","MEM7", "MEM8", "MaxTemp", "MinTemp")), size = 4,  color="red")+theme_bw()+xlab("Redundant Axis 1 - 58%")+ylab("Redundant Axis 1 - 17%")
RDA Triplot

RDA Triplot

## screeplot

constrained_percent <- rda_full$CCA$eig/rda_full$tot.chi*100
unconstrained_percent <- rda_full$CA$eig/rda_full$tot.chi*100
barplot (c(constrained_percent, unconstrained_percent), col = c(rep ('red', length (constrained_percent)), rep ('black', length (unconstrained_percent))), las = 2, ylab = '% variation')
Screeplot of RDA - red is constrained axes, black is unconstrained axes

Screeplot of RDA - red is constrained axes, black is unconstrained axes

minor notes for deeper analysis if wanted
* if we conduct the RDA on all pcs (not just those exceeding the kaiser guttman criteria) we can relate the RDA results more clearly to FST, i.e. we can see how much of the total variation among the cline is explained by RDA and compare this to results from amova and so on
* cRDA (RDA on the components of genetic variation) seems to be less powerful at identifying adaptive loci, largely because these loci can load into pcs, could do a locus by locus RDA, but this would present its own challenges - somehting to consider

SFS

Work using the SFS is contained in a seaprate R notebook, all code below is draft for future work

SFS Analysis Outline
* Make consensus sequence from grandis reads to polarize the SAF - doFasta -2
* estimate SAFs for each population and metapopulation -doSAF * estimate per population SFS
* estimate 2dSFS for each pair of populations
* 1d and 2dSFS used in downstream analyses: per-site summary statistics, global Fst, and dadi

questions: is it possible to estimate SAF without re-estimating GLs (we already have them) what filtering is neccesarry for the population and 2d sfs, when there are so few indiviudals a maf of 1% becomes unmeaningful, should we filter to a higher MAF or at least, increase min number of individuals for a site to be retained (i.e. saf estimate is likely to be very very bad if there are sites with 2 individuals) *how should we pool sampling locales into metapopulations (i.e. it’s clear that all the NBH populations should be combined into a single pop, but what about all NJ populations or all ME populations) - what spatial or genetic scale should be the cutoff for pooling populations

1d SFS estimation

ancestral sequence

Polarize SAF by an outgroup (f grandis) use doFasta -2 with -doCounts 1 and the same filters as $FILTERS from the GL run


#!/bin/bash

# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00

#SBATCH --cpus-per-task=38

# set partition/queue to use
#SBATCH --partition=week-long-cpu

# set name of job
#SBATCH --job-name=consensus

# set name of output file
#SBATCH --output=consensus.out


ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"

FILTERS="-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -minMapQ 20"
DOS="-doFasta 2 -doCounts 1 -dumpCounts 2" 

$ANGSD -P 38 -b ../meta_data/grandis.bams -ref $REF -out ./Results/grandis_consensus.fa \
  $FILTERS $DOS \

SAF 1 big population

2d SFS

global FST

Per site summary stats